Skip to main content

Early Web Datasets

Various datasets created for research, scholarly, or general use, generated from the "early web" era (1996-1999) of the Internet Archive's global web archive collection.

rss RSS

Show sorted alphabetically
Show sorted alphabetically
up-solid down-solid
Date Reviewed
This dataset consists of multi-language URLs of the early web (1996-1999), grouped by common URL patterns. “Parallel language” refers to the same text represented in different languages; multi-language text from websites are a rich source for parallel language corpora and can be valuable in work such as machine translation. The URLs and patterns were derived and extracted from the Internet Archive’s Wayback Machine web archive, for all archived webpages successfully captured before year...
This dataset consists of metadata records identifying the language of webpages from the early web (1996-1999) portion of the Internet Archive’s web archive collection. The metadata includes specific capture information related to a URL along with the top 3 language annotations for each page, as detected by Google’s Compact Language Detector (CLD3) , a neural network model for language identification. Confidence scores are computed according to Internet Archive’s own Square-Leaf-Model,...